{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 进阶篇：pandas操作技巧系列（一）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 操作技巧1：变量类型自动转换\n",
    "在用pandas进行数据清洗的过程中，变量的类型转换是一个必然会遇到的步骤。清洗初期查看dtypes经常出现object类型，但其实变量本身可能就是个字符串，或者是数字（但因存在空值，导致出现了object类型）。\n",
    "\n",
    "通常大家所熟知的方法是使用astype进行类型转换，或者自己利用astype造个轮子，写个函数方法实现自动转换类型。\n",
    "\n",
    "本次东哥介绍一个pandas里可实现自动转换变量类型的方法convert_dtypes。利用它可以一次性全部转换为最理想的类型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用方法\n",
    "默认情况下，convert_dtypes将尝试将Series或DataFrame中的每个Series转换为支持的dtypes。它可以对Series和DataFrame都直接使用。\n",
    "\n",
    "这个方法的参数如下。\n",
    "```python\n",
    "# 是否应将对象dtypes转换为最佳类型\n",
    "infer_objects bool，默认为True\n",
    "\n",
    "# 对象dtype是否应转换为StringDtype()\n",
    "convert_string bool，默认为True\n",
    "\n",
    "# 如果可能，是否可以转换为整数扩展类型\n",
    "convert_integer bool，默认为True\n",
    "\n",
    "# 对象dtype是否应转换为BooleanDtypes()\n",
    "convert_boolean bool，默认为True\n",
    "\n",
    "# 如果可能，是否可以转换为浮动扩展类型。\n",
    "# 如果convert_integer也为True，则如果可以将浮点数忠实地转换为整数，则将优先考虑整数dtype\n",
    "convert_floating bool，默认为True\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实例\n",
    "下面看个例子。\n",
    "\n",
    "首先创建一组数据，通过dtype规定每个变量的类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"a\": pd.Series([1, 2, 3], dtype=np.dtype(\"int32\")),\n",
    "        \"b\": pd.Series([\"x\", \"y\", \"z\"], dtype=np.dtype(\"O\")),\n",
    "        \"c\": pd.Series([True, False, np.nan], dtype=np.dtype(\"O\")),\n",
    "        \"d\": pd.Series([\"h\", \"i\", np.nan], dtype=np.dtype(\"O\")),\n",
    "        \"e\": pd.Series([10, np.nan, 20], dtype=np.dtype(\"float\")),\n",
    "        \"f\": pd.Series([np.nan, 100.5, 200], dtype=np.dtype(\"float\")),\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DataFrame 变量类型转换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>e</th>\n",
       "      <th>f</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>x</td>\n",
       "      <td>True</td>\n",
       "      <td>h</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>y</td>\n",
       "      <td>False</td>\n",
       "      <td>i</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.0</td>\n",
       "      <td>200.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b      c    d     e      f\n",
       "0  1  x   True    h  10.0    NaN\n",
       "1  2  y  False    i   NaN  100.5\n",
       "2  3  z    NaN  NaN  20.0  200.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a      int32\n",
       "b     object\n",
       "c     object\n",
       "d     object\n",
       "e    float64\n",
       "f    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过结果可以看到，变量都是是创建时默认的类型。但其实变量是有整数、字符串、布尔的，其中有的还存在空值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>e</th>\n",
       "      <th>f</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>x</td>\n",
       "      <td>True</td>\n",
       "      <td>h</td>\n",
       "      <td>10</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>y</td>\n",
       "      <td>False</td>\n",
       "      <td>i</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>100.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>z</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>20</td>\n",
       "      <td>200.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b      c     d     e      f\n",
       "0  1  x   True     h    10   <NA>\n",
       "1  2  y  False     i  <NA>  100.5\n",
       "2  3  z   <NA>  <NA>    20  200.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfn = df.convert_dtypes()\n",
    "dfn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面使用convert_dtypes进行转换。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a      Int32\n",
       "b     string\n",
       "c    boolean\n",
       "d     string\n",
       "e      Int64\n",
       "f    Float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfn.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "变量类型已经转换为我们想要的了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Series 变量类型转换\n",
    "对Series的转换也是一样的。下面的Seires中由于存在nan空值所以类型为object。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      a\n",
       "1      b\n",
       "2    NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = pd.Series([\"a\", \"b\", np.nan])\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们通过convert_dtypes成功转换为String。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       a\n",
       "1       b\n",
       "2    <NA>\n",
       "dtype: string"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.convert_dtypes()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果未来增加了新类型，convert_dtypes方法也会同步更新，并支持新的变量类型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 操作技巧2：json数据解析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用API和文档数据库会返回嵌套的JSON对象，当我们使用Python尝试将嵌套结构中的键转换为列时，数据加载到pandas中往往会得到如下结果：\n",
    "\n",
    "> df = pd.DataFrame.from_records（results [“ issues”]，columns = [“ key”，“ fields”]）\n",
    "\n",
    "![](https://camo.githubusercontent.com/348a155256bc6ae9e19dbb0a8d7da62be2d4ab6e49b9c6b4bbb5c895767760c7/68747470733a2f2f6d6d62697a2e717069632e636e2f737a5f6d6d62697a5f706e672f4e4f4d35484e326963587a787571636b5547667646347a4751345a31466f664b5747654f3256427a4a564a6f754b42524966326c4e576655314d316963597a78377a584c4e4474616961744a5953384166486962547836696146512f3634303f77785f666d743d706e672674703d7765627026777866726f6d3d352677785f6c617a793d312677785f636f3d31)\n",
    "\n",
    "说明：这里results是一个大的字典，issues是results其中的一个键，issues的值为一个嵌套JSON对象字典的列表，后面会看到JSON嵌套结构。\n",
    "\n",
    "问题在于API返回了嵌套的JSON结构，而我们关心的键在对象中确处于不同级别。\n",
    "\n",
    "嵌套的JSON结构张成这样的。\n",
    "![](https://camo.githubusercontent.com/ff92a9d6b9d8f01007eb116940b6091a924ded78a6ce7429726d3f6700a51066/68747470733a2f2f6d6d62697a2e717069632e636e2f737a5f6d6d62697a5f706e672f4e4f4d35484e326963587a787571636b5547667646347a4751345a31466f664b5747426d36644a57427a48644b664e456963696357343347496839477a346652314335485037535563784270364d696148493969624e6962425732772f3634303f77785f666d743d706e672674703d7765627026777866726f6d3d352677785f6c617a793d312677785f636f3d31)\n",
    "\n",
    "而我们想要的是下面这样的。\n",
    "\n",
    "![](https://mmbiz.qpic.cn/sz_mmbiz_png/NOM5HN2icXzxuqckUGfvF4zGQ4Z1FofKWGeO2VBzJVJouKBRIf2lNWfU1M1icYzx7zXLNDtaiatJYS8AfHibTx6iaFQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1)\n",
    "\n",
    "下面以一个API返回的数据为例，API通常包含有关字段的元数据。假设下面这些是我们想要的字段。\n",
    "\n",
    "- key：JSON密钥，在第一级的位置。\n",
    "- summary：第二级的“字段”对象。\n",
    "- status name：第三级位置。\n",
    "- statusCategory name：位于第4个嵌套级别。\n",
    "\n",
    "如上，我们选择要提取的字段在issues列表内的JSON结构中分别处于4个不同的嵌套级别，一环扣一环。\n",
    "```json\n",
    "{\n",
    "  \"expand\": \"schema,names\",\n",
    "  \"issues\": [\n",
    "    {\n",
    "      \"fields\": {\n",
    "        \"issuetype\": {\n",
    "          \"avatarId\": 10300,\n",
    "          \"description\": \"\",\n",
    "          \"id\": \"10005\",\n",
    "          \"name\": \"New Feature\",\n",
    "          \"subtask\": False\n",
    "        },\n",
    "        \"status\": {\n",
    "          \"description\": \"A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.\",\n",
    "          \"id\": \"5\",\n",
    "          \"name\": \"Resolved\",\n",
    "          \"statusCategory\": {\n",
    "            \"colorName\": \"green\",\n",
    "            \"id\": 3,\n",
    "            \"key\": \"done\",\n",
    "            \"name\": \"Done\",\n",
    "          }\n",
    "        },\n",
    "        \"summary\": \"Recovered data collection Defraglar $MFT problem\"\n",
    "      },\n",
    "      \"id\": \"11861\",\n",
    "      \"key\": \"CAE-160\",\n",
    "    },\n",
    "    {\n",
    "      \"fields\": { \n",
    "... more issues],\n",
    "  \"maxResults\": 5,\n",
    "  \"startAt\": 0,\n",
    "  \"total\": 160\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 解决方案\n",
    "pandas中有一个牛逼的内置功能叫 .json_normalize。\n",
    "\n",
    "pandas的文档中提到：将半结构化JSON数据规范化为平面表。\n",
    "\n",
    "前面方案的所有代码，用这个内置功能仅需要3行就可搞定。步骤很简单，懂了下面几个用法即可。\n",
    "\n",
    "确定我们要想的字段，使用 . 符号连接嵌套对象。\n",
    "\n",
    "将想要处理的嵌套列表（这里是results[\"issues\"]）作为参数放进 .json_normalize 中。\n",
    "\n",
    "过滤我们定义的FIELDS列表。\n",
    "```python\n",
    "FIELDS = [\"key\", \"fields.summary\", \"fields.issuetype.name\", \"fields.status.name\", \"fields.status.statusCategory.name\"]\n",
    "df = pd.json_normalize(results[\"issues\"])\n",
    "df[FIELDS]\n",
    "```\n",
    "![](https://camo.githubusercontent.com/557bc7e474fb9eee2d9b71c6f5899f07deca1f0698d961042d7751a9453cbb02/68747470733a2f2f6d6d62697a2e717069632e636e2f737a5f6d6d62697a5f706e672f4e4f4d35484e326963587a7875536f787841626b77566d7574704356754f4d43486c6962624e3351336563725458727a39764c69623476616c41456141537a345a76424669624c654244626b6a65436e494a47745533363170412f3634303f77785f666d743d706e672674703d7765627026777866726f6d3d352677785f6c617a793d312677785f636f3d31)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 其它操作\n",
    "#### 记录路径\n",
    "除了像上面那样传递results[\"issues\"]列表之外，我们还使用record_path参数在JSON对象中指定列表的路径。\n",
    "\n",
    "```oython\n",
    "# 使用路径而不是直接用results[\"issues\"]\n",
    "pd.json_normalize(results, record_path=\"issues\")[FIELDS]\n",
    "```\n",
    "\n",
    "#### 自定义分隔符\n",
    "还可以使用sep参数自定义嵌套结构连接的分隔符，比如下面将默认的“.”替换“-”。\n",
    "\n",
    "```python\n",
    "### 用 \"-\" 替换默认的 \".\"\n",
    "FIELDS = [\"key\", \"fields-summary\", \"fields-issuetype-name\", \"fields-status-name\", \"fields-status-statusCategory-name\"]\n",
    "pd.json_normalize(results[\"issues\"], sep = \"-\")[FIELDS]\n",
    "```\n",
    "\n",
    "#### 控制递归\n",
    "如果不想递归到每个子对象，可以使用max_level参数控制深度。在这种情况下，由于statusCategory.name字段位于JSON对象的第4级，因此不会包含在结果DataFrame中。\n",
    "```python\n",
    "#只深入到嵌套第二级 \n",
    "pd.json_normalize(results, record_path=\"issues\", max_level = 2)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 操作技巧3：Datetime时间类型提速\n",
    "平时我们运行pandas少不了和时间打交道，而大多情况下许多朋友都是暴力解决问题，直接让pandas自己转换和处理。\n",
    "\n",
    "对于平时的学习和小测试是没什么问题的，但当跑一些大数据的时候往往会非常的慢，而这个时间性能其实是完全可以优化的。\n",
    "\n",
    "本次东哥介绍一个非常简单的操作，使用Datetime变换时间类型，让你的代码运行速度飞速提升。\n",
    "\n",
    "下面，我们来看一个例子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date_time</th>\n",
       "      <th>energy_kwh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1/1/13 0:00</td>\n",
       "      <td>0.586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1/1/13 1:00</td>\n",
       "      <td>0.580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1/1/13 2:00</td>\n",
       "      <td>0.572</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1/1/13 3:00</td>\n",
       "      <td>0.596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1/1/13 4:00</td>\n",
       "      <td>0.592</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     date_time  energy_kwh\n",
       "0  1/1/13 0:00       0.586\n",
       "1  1/1/13 1:00       0.580\n",
       "2  1/1/13 2:00       0.572\n",
       "3  1/1/13 3:00       0.596\n",
       "4  1/1/13 4:00       0.592"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('data/demand_profile.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从运行上面代码得到的结果来看，好像没有什么问题。但实际上pandas和numpy都有一个 dtypes 的概念。如果没有特殊声明，那么date_time将会使用一个 object 的 dtype 类型，如下面代码所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "date_time      object\n",
       "energy_kwh    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "str"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df.iat[0, 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "object 类型像一个大的容器，不仅仅可以承载 str，也可以包含那些不能很好地融进一个数据类型的任何特征列。而如果我们将日期作为 str 类型就会极大的影响效率。\n",
    "\n",
    "因此，对于时间序列的数据而言，我们需要让上面的date_time列格式化为datetime对象数组（pandas称之为时间戳）。pandas在这里操作非常简单，操作如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('<M8[ns]')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['date_time'] = pd.to_datetime(df['date_time'])\n",
    "df['date_time'].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们来运行一下这个df看看转化后的效果是什么样的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date_time</th>\n",
       "      <th>energy_kwh</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2013-01-01 00:00:00</td>\n",
       "      <td>0.586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2013-01-01 01:00:00</td>\n",
       "      <td>0.580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2013-01-01 02:00:00</td>\n",
       "      <td>0.572</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2013-01-01 03:00:00</td>\n",
       "      <td>0.596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2013-01-01 04:00:00</td>\n",
       "      <td>0.592</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date_time  energy_kwh\n",
       "0 2013-01-01 00:00:00       0.586\n",
       "1 2013-01-01 01:00:00       0.580\n",
       "2 2013-01-01 02:00:00       0.572\n",
       "3 2013-01-01 03:00:00       0.596\n",
       "4 2013-01-01 04:00:00       0.592"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "date_time的格式已经自动转化了，但这还没完，在这个基础上，我们还是可以继续提高运行速度的。如何提速呢？为了更好的对比，我们首先通过 timeit 装饰器来测试一下上面代码的转化时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from zyf_timer import repeat_timeit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Function convert -> 3 trials 10 function calls: average trial 11.169 seconds, average function call 1.117 seconds\n"
     ]
    }
   ],
   "source": [
    "@repeat_timeit(repeat=3, number=10)\n",
    "def convert(df, column_name):\n",
    "    return pd.to_datetime(df[column_name])\n",
    "\n",
    "df['date_time'] = convert(df, 'date_time')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.117s，看上去挺快，但其实可以更快，我们来看一下下面的方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Function convert -> 3 trials 10 function calls: average trial 0.158 seconds, average function call 0.016 seconds\n"
     ]
    }
   ],
   "source": [
    "@repeat_timeit(repeat=3, number=100)\n",
    "def convert_with_format(df, column_name):\n",
    "    return pd.to_datetime(df[column_name],format='%d/%m/%y %H:%M')\n",
    "\n",
    "df['date_time'] = convert(df, 'date_time')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果只有0.016s，快了将近70倍。\n",
    "\n",
    "原因是：我们设置了转化的格式format。由于在CSV中的datetimes并不是 ISO 8601 格式的，如果不进行设置的话，那么pandas将使用 dateutil 包把每个字符串str转化成date日期。\n",
    "\n",
    "相反，如果原始数据datetime已经是 ISO 8601 格式了，那么pandas就可以立即使用最快速的方法来解析日期。这也就是为什么提前设置好格式format可以提升这么多。\n",
    "\n",
    "当然，这个只是在时间处理上的一个提速小操作。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
